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Coherence-based multivariate analysis of high frequency stock market values 
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The paper tackles the problem of deriving a topological structure among stock prices from high 
frequency historical values. Similar studies using low frequency data have already provided valuable 
insights. However, in those cases data need to be collected for a longer period and then they have 
to be detrended. An effective technique based on averaging a metric function on short subperiods 
of the observation horizon is suggested. Since a standard correlation-based metric is not capable of 
catching dependencies at different time instants, it is not expected to perform the best when dealing 
with high frequency data. Hence, the choice of a more suitable metric is discussed. In particular, a 
coherence-based metric is proposed, for it is able to detect any possible linear relation between two 
times series, even at different time instants. The averaging technique is employed to analyze a set 
of 100 high volume stocks of the New York Stock Exchange, observed during March 2008. 
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I. INTRODUCTION 



Deriving information from financial markets is a 
formidable and challenging task. During the last decades 
many approaches have been developed and in recent 
years physicists and engineers have proposed novel ap- 
proaches [E S S It has already been proved that 
the time series of financial prices are related and in this 
respect many correlation analyses have provided useful 
contributions and insights [1, @]- However, these anal- 
yses were carried on using daily (i.e. "low frequency") 
data, while in the financial field a very important issue 
of the decision process is to understand and to quantify 
the influences, which occur in a short time range. Hence, 
procedures specifically designed to deal with "high fre- 
quency" data are desirable. Further motivations to pre- 
fer a short period analysis can be found in a reduced pre- 
filtering stage. Indeed, economic processes are naturally 
organized into different time units (days, weeks, months 
and years) and trends and seasonal factors are expected 
to be less relevant at small time scales, i.e. when the 
influence of short time financial phenomena is dominant. 
In scientific literature and especially in the econometrics 
and engineering fields several techniques have been devel- 
oped to derive structured information from the sampled 
data. In particular, the black box approach is probably 
the most common paradigm in the linear framework (see 
G-g- 0)- On the other hand, in the "high frequency" 
scenario, spectral analysis seems to provide a better ap- 
proach to the problem, as it has recently been observed 
in Q . In this paper we pursue a different approach, that 
has been recently introduced in [9] and that is based on 
Wiener filtering and frequency analysis. In particular, 
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such a method exploits linear modeling techniques to de- 
fine dynamical laws describing in a qualitative and quan- 
titative way the connections among the time series of the 
data set. The main points of this approach will be intro- 
duced in Section [TTl while Section |TTT] will be devoted to 
the application of the method to the stock price historical 
series, sampled at high frequency. Finally, the results will 
be interpreted in terms of graph theory and the relative 
clusterization will be presented as well. 



II. THEORETICAL FRAMEWORK 

It is known that financial time series are non-stationary 
and that, conversely, they exhibit behaviours affected by 
cyclical and non-cyclical trends. Thus, in most situations 
low frequency data need to be detrended, that is removed 
of any deterministic component in order to reconstruct 
them as stationary processes, which can be studied in 
the standard statistical framework. On the other hand, 
high frequency data, if observed over a short period, can 
approximately be considered stationary, since the trend 
components mainly present slow variations. 
In this paper, we suggest a technique to exploit high fre- 
quency data in the multivariate analysis described in . 
In particular, we will introduce a strategy suitable to 
deal with such high frequency data over long time hori- 
zons, i.e. when the trend components can not be con- 
sidered negligible. Moreover, we want to underline that 
the method presented [e^ turns out to be the "dynam- 
ical" extension of the procedure originally described in 
1]. In this respect, we find useful to recall that in [l[ 
an original criterion to perform multivariate analysis of 
the stock market historical values was introduced. The 
authors were mainly concerned with the problem of deriv- 
ing a hierarchical structure among the stocks and to this 
aim they focused on the mutual influences and similar- 
ities inside the market. In the following, we summarize 
the theoretical background they introduced to develop 
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their criterion. 

Consider a set of N time series, which are first properly 
scaled and then modelled as realizations of stationary 
stochastic processes, namely Xi, i — 1, . . . ,N. For each 
possible couple {Xi,Xj) an estimate of the correlation 
index pji is computed, along with the quantity 



(1) 



In particular, it is worth observing that the function ([T]) 
is a metric In [l| each process Xi is interpreted as 
a node in a graph and dji as the weight of the arc be- 
tween Xi and Xj. Then, the Minimum Spanning Tree 
(MST) is extracted by the related graph, obtaining a 
simple connected structure based on the strongest simi- 
larities emerged from the correlation analysis. Notably, 
this procedure has been successfully exploited to provide 
a topological analysis of time series in financial markets 
using daily data (see e.g. [E[E[l3|)- 
Recently, the metric ffl has been interpreted in terms of 
a modeling procedure ))\. Indeed, let the possible connec- 
tions between two nodes (processes) be described by the 
simple scalar constant aji. Then, modeling Xj by means 
of Xi produces the error eji defined as Cji — Xj — ajiXi 
and, thus, by choosing 



_E[XJ] 
- E[Xf] 

it is immediate to observe that 
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Therefore, the weight function results to be the root 
square of the ratio between the "powers" of the mod- 
eling error and the modeled process. Such an interpre- 
tation gives the basis for a deeper comprehension of the 
original procedure. Indeed, the distance dji derives from 
the modeling error Cji, that is computed exploiting the 
constant gain aji to represent the relationship between 
the two processes Xj and Xi. Therefore, such a dis- 
tance is not suitable to describe analogies more complex 
than proportional laws. This reasoning is also confirmed 
by the observation that dji is the result of a correlation 
analysis and that pji can just capture the similarities, 
which occur at the same time instant. For instance, it 
can not properly detect analogies in presence of delays 
or time shifts. Hence, let us provide an ideal example to 
stress this important point. 

Consider a time-discrete, white and stationary stochastic 
process Xa with unitary power, i.e. such that: 



E[Xa{t)Xa{t + T)] = 



if 
if 



T 7^ 
T = 
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Further, let X^ be the process obtained from Xa intro- 
ducing a one time unit delay Xb{t) = Xa{t — 1). Accord- 
ing to IH) and because of the whiteness feature (jH), it 



holds that dba = V^- Then, even though the behaviour of 
Xb can be exactly derived from Xa, according to (P) the 
two processes appear as they were not related at all. Such 
a simple example is useful to underline that the original 
procedure introduced in T| can not be exploited to catch 
information between samples at different time instants. 
According to the above reasoning, the correlation-based 
analysis turns out to be more suitable when one deals 
with data sampled at a low frequency, since in such a 
situation small delays pass unseen. Conversely, at high 
frequencies, we expect a correlation-based analysis not to 
perform the best. In Q the authors propose a different 
multivariate analysis, that mainly extends the correla- 
tion approach in order to properly handle the presence 
of delays, so to detect similarities even at different time 
instants. 

The generic time series X can be equivalently represented 
by its Z-transform, defined as X{z) = J2n=-oo X{n)z~". 
Exploiting the Z-transformation operator, any linear de- 
pendence, even the ones involving different time instants, 
can be transformed into an algebraic equation in the vari- 
able z. Thus, the modeling error can be defined em- 
ploying linear dynamical models, instead of considering 
a simple scalar constant as in ([2]) . Within this paradigm, 
a new metric is defined: 



d{X„Xj) = 



where Cx^x, {^^) 



^1 {l~Cx,xA^))d^ 
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(5) 



is the well-known coher- 



ence function defined in terms of the spectral densities 
$Xi(w), ^Xji(^) and $XjXi(w). The distance (O should 
be apter to describe relations among high frequency fi- 
nancial time series, since it is a direct byproduct of a 
Wiener identification procedure, i.e. it provides the op- 
timal minimization error in the sense of the least squares 
[?]. Further, it is worth noticing that the coherence func- 
tion has already been successfully employed in economet- 
rics to perform statistical analysis devoted to quantify the 
information shared by two time series [1 1|] . 



III. STOCK MARKET ANALYSIS 

A collection of 100 stocks of the New York Stock Ex- 
change have been observed for four weeks (twenty mar- 
ket days), in the lapse 03/03/2008 - 03/28/2008 sampHng 
their prices every 2 minutes. The stocks have been chosen 
on the first 100 stocks with highest trading volume ac- 
cording to the Standard & Poor Index at the first day of 
observation. An a-priori organization of the market has 
been assumed in accordance with the sector and industry 
group classification provided by Google Finance®, that 
is also the source of our data. We underline that in this 
paper we are mainly concerned with sectors, but in some 
cases we have also taken into account the industry group 
classification to refine the results. In the following, we 
introduce a strategy suitable for the application of the 
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method presented in the previous section to the high fre- 
quency scenario. 

The whole observation horizon spans the month of 
March. Hence, the corresponding price series can not be 
considered stationary and the statistical tools can not be 
successfully employed to analyze the raw data. In litera- 
ture a variety of techniques for the suppression of trends 
and periodic components in non-stationary time series 
exists. However, we want to stress that the application 
of such procedures introduces an additional prefiltering 
phase, which is responsible for the computational burden 
increase. Conversely, hereafter we present a method to 
avoid data prefiltering, obtaining a reduction of compu- 
tations. 

It is worth observing that the observation horizon is nat- 
urally divided into subperiods, namely weeks and days. 
Indeed, due to the pre- and post-market sessions, there 
is a discontinuity between the end value of a day and the 
opening price of the next one. Moreover, a single mar- 
ket session can be considered a time period sufficiently 
short to assume that the influence of trends and sea- 
sonal factors are negligible. Thus, in our analysis, we 
have followed the natural approach of dividing the histor- 
ical series into twenty subperiods corresponding to single 
days. Then, we have performed the multivariate analy- 
sis for each of these sessions, i.e. we have computed the 
correlation-based distance ([1]) and the coherence-based 
one ([5]) among the stocks. Finally, we have averaged such 
distances over the whole observation horizon and the re- 
lated results have been exploited to extract the MST, 
providing the corresponding market structure. We find 
useful to remark that the computation of the distances 
for small data sets is best performing and that the aver- 
aging procedure provides the desired rejection of trends 
and seasonal components. Notably, a similar idea, even 
if more sophisticated, is at the basis of the method devel- 
oped in [8] to detrend non-stationary time series. A com- 
parison of the results obtained using the two metrics is 
shown in Figure[TJ Every node represents a stock and the 
color represents the business sector or industry it belongs 
to. Figure [T^ refers to the correlation tree. We note that 
the stocks are quite satisfactorily grouped according to 
their business sectors. However, as foreseen, we observe 
an general increased capacity of grouping stocks accord- 
ing to their sectors in the coherence tree of Figure [T|3. It 
means that the modeling approach of 0] has been able to 
capture time shifted and dynamical dependencies among 
the time series. We stress that the a-priori classification 
in sectors is not a hard fact by itself and we are not try- 
ing to match it exactly. A company could well be catego- 
rized in a sector because of its business, but, at the same 
time, could show a behaviour similar to and explain- 
able through the dynamics of other sectors. Actually, 
we would be very interested into finding results of this 
kind. Indeed, in those very cases, our quantitative anal- 
ysis would provide the greatest contributions detecting in 
an objective way something which is "counter-intuitive" . 
Thus, we just use such a-priori classification as a tool to 



check if the final topology makes sense and if, at a gen- 
eral level, the coherence approach performs better. De- 
spite this disclaim, it is worth noting that the Financial 
(green tints). Consumer (violet tints), Basic Materials 
(yellow). Energy (gray tints) and Transportation (dark 
blue) sectors are all perfectly grouped, with no excep- 
tions. In Figure [T]d, we note a subclusterization of the 
Financial sector, as well. Such a finer detail can not 
be detected in the correlation-based tree. The Consumer 
sector shows another prominent subclusterization in the 
Food (plum) and Personal/Healthcare (purple) indus- 
tries, while the Energy sector presents an evident sub- 
clusterization into the Oil & Gas (dark gray) and Oil 
Well Equipment (light gray). In this case also the cor- 
relation approach shows them clearly. In addition both 
the approaches show the close presence of companies of 
a different sector and industry, Utilities/Natural Gas 
(light brown). The other Utilities/Electricity com- 
panies (dark brown) are, interestingly, a different group. 
We also observe a big cluster of companies classified as 
Services (light blue tints). The correlation tree is not 
equally capable of grouping these companies together. 
We have differentiated them in the two industries Retail 
and Information Technology using two slightly differ- 
ent colors, respectively aquamarine and cyan. We also 
note the presence of three Services companies which are 
isolated from the other ones: V [Verizon], T [AT&T], and 
S [Sprint]. All of them are telephone companies. This 
might suggest that this industry should show at least a 
slightly different dynamics from the other service com- 
panies. Note also how the Technology sector (red) is 
almost perfectly grouped and how IBM, an IT company, 
even though classified as a Services company, is located 
in it. Finally, the two only automobile companies GM 
and F [Ford] happen to be linked together. The analysis 
of this four weeks of the month of March cleanly shows 
a taxonomic arrangement of the stocks even though the 
choice of a tree structure might have seemed quite reduc- 
tive at first thought. 



IV. CONCLUSION 

We obtain a structural characterization of stocks using 
high frequency data. A tree topology is derived, show- 
ing a strong taxonomic arrangement of the price time 
series. Though this property has already been proved 
for daily prices, to the best of the authors' knowledge, 
a similar analysis has never been carried on using high 
frequency data. The analysis of a collection of 100 high 
volume stocks of the New York Stock Exchange has been 
used to evaluate the results. A metric based on the co- 
herence function has also been employed to quantify the 
"closeness" of two price historical series. It is shown to 
perform consistently better than a standard correlation 
metric, suggesting the presence of propagative and dy- 
namic phenomena involved in the considered financial 
network. Though their presence can not be considered 




FIG. 1: The tree structure obtained using a correlation-based analysis (a) and a coherence based one (b). Every node 
represents a stock and the color represents the business sector it belongs to. The considered sectors are Basic Material (yel- 
low), Conglomerates (white), Healthcare (pink). Transportations (dark blue). Technology (red), Capital Goods (orange), 
Utilities (brown tints). Consumer (violet tints). Financial (green tints). Energy (gray tints) Services (light blue tints). 
Using the industry classification given by Google, the Financial sector has also been differentiated among Insurance Compa- 
nies (light green). Banks (average green) and Investment Companies (dark green); Services have been divided in Information 
Technology (cyan) and Retail (aquamarina) , Consumer in Food (plum) and Personal-care (purple); Energy in Oil & Gas (dark 
gray) and Well Equipment (light gray); Utilities in Electrical (dark brown) and Natural Gas (light brown). The comparison 
between (a) and (b) underlines a better capability of the coherence-based method in grouping stocks not only according to 
their sector, but also according their industry. 

preponderant, they provides information which is not di- rectly captured from a simple correlation analysis. 
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